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SKIP SEQUENCING: A DECISION PROBLEM 
IN QUESTIONNAIRE DESIGN 

By Charles F. Manski 1 and Francesca Molinari 2 

Northwestern University and Cornell University 

This paper studies questionnaire design as a formal decision prob- 
lem, focusing on one element of the design process: skip sequencing. 
We propose that a survey planner use an explicit loss function to 
quantify the trade-off between cost and informativeness of the sur- 
vey and aim to make a design choice that minimizes loss. We pose 
a choice between three options: ask all respondents about an item 
of interest, use skip sequencing, thereby asking the item only of re- 
spondents who give a certain answer to an opening question, or do 
not ask the item at all. The first option is most informative but also 
most costly. The use of skip sequencing reduces respondent burden 
and the cost of interviewing, but may spread data quality problems 
across survey items, thereby reducing informativeness. The last op- 
tion has no cost but is completely uninformative about the item of 
interest. We show how the planner may choose among these three 
options in the presence of two inferential problems, item nonresponse 
and response error. 

1. Introduction. Designing a questionnaire for administration to a sam- 
ple of respondents requires many decisions about the items to be asked, the 
wording and ordering of the questions, and so on. Considerable research has 
investigated the item response rates and patterns associated with alternative 
designs. See Krosnick (1999) for a recent review of the literature. Researchers 
have also called attention to the tension between the desire to reduce the 
costs and increase the informativeness of surveys. See, for example, Groves 
(1987) and Groves and Heeringa (2006). However, survey researchers have 
not studied questionnaire design as a formal decision problem in which one 
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uses an explicit loss function to quantify the trade-off between cost and in- 
formativeness and aims to make a design choice that minimizes loss. This 
paper takes an initial step in that direction. We consider one element of the 
design problem, the use of skip sequencing. 

Skip sequencing is a widespread survey practice in which the response to 
an opening question is used to determine whether a respondent should be 
asked certain subsequent questions. The objective is to eliminate inappli- 
cable questions, thereby reducing respondent burden and the cost of inter- 
viewing. However, skip sequencing can amplify data quality problems. In 
particular, skip sequencing exacerbates the identification problems caused 
by item nonresponse and response errors. 

A respondent may not answer the opening question. When this happens, 
a common practice is to label the subsequent questions as inapplicable. How- 
ever, they may be applicable, in which case the item nonresponse problem is 
amplified. Another practice is to impute the answer to the opening question 
and, if the imputation is positive, to also impute answers to the subsequent 
questions. Some of these imputations will inevitably be incorrect. A partic- 
ularly odd situation occurs when the answer to the opening question should 
be negative but the imputation is positive. Then answers are imputed to 
subsequent questions that actually are inapplicable. 

A respondent may answer the opening question with error. An error may 
cause subsequent questions to be skipped, when they should be asked, or vice 
versa. An error of the first type induces nonresponse to the subsequent ques- 
tions. The consequences of an error of the second type depend on how the 
respondent answers the subsequent questions, having answered the opening 
one incorrectly. 

Illustration 1. The 2006 wave of the Health and Retirement Study 
(HRS) asked current Social Security recipients about their expectations for 
the future of the Social Security system. An opening question asked broadly: 
"Thinking of the Social Security program in general and not just your own 
Social Security benefits: On a scale from to 100 (where means no chance 
and 100 means absolutely certain), what is the percent chance that Congress 
will change Social Security sometime in the next 10 years, so that it becomes 
less generous than now?" If the answer was a number greater than zero, 
a follow-up question asked "We just asked you about changes to Social 
Security in general. Now we would like to know whether you think these 
Social Security changes might affect your own benefits. On a scale from 
to 100, what do you think is the percent chance that the benefits you 
yourself are receiving from Social Security will be cut some time over the 
next 10 years?" If a person did not respond to the opening question or gave 
an answer of 0, the follow-up question was not asked. 
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Illustration 2. The 1990 wave of the National Longitudinal Survey of 
Older Men (NLSOM) queried respondents about their limitations in activi- 
ties of daily living (ADLs). An opening question asked broadly: "Because of 
a health or physical problem, do you ever need help from anyone in looking 
after personal care such as dressing, bathing, eating, going to the bathroom, 
or other such daily activities?" If the answer was positive, the respondent 
was then asked if he/she receives help from another person in each of six spe- 
cific ADLs (bathing/showering, dressing, eating, getting in or out of a chair 
or bed, walking, using the toilet). If the answer was negative or missing, the 
subsequent questions were skipped out. 

These illustrative uses of skip sequencing save survey costs by asking a 
broad question first and by following up with a more specific question only 
when the answer to the broad question meets specified criteria. However, 
nonresponse or response error to the opening question may compromise the 
quality of the data obtained. 

This paper studies skip sequencing as a decision problem in questionnaire 
design. We suppose that a survey planner is considering whether and how 
to ask about an item of interest. Three design options follow: 

Option All (A): ask all respondents the question. 

Option Skip (S): ask only those respondents who respond positively 

to an opening question. 

Option None (N): do not ask the question at all. 

These options vary in the cost of administering the questions and in the 
informativeness of the data they yield. Option (A) is most costly and is 
potentially most informative. Option (S) is less costly but may be less infor- 
mative if the opening question has nonresponse or response errors. Option 
(TV) has no cost but is uninformative about the item of interest. We sup- 
pose that the planner must choose among these options, weighing cost and 
informativeness as he deems appropriate. We suggest an approach to this 
decision problem and give illustrative applications. 

The paper is organized as follows. As a prelude, Section 2 summarizes the 
few precedent studies that consider the data quality aspects of skip sequenc- 
ing. These studies do not analyze skip sequencing as a decision problem. 

Section 3 formalizes the problem of choice among design options. We 
assume that the survey planner wants to minimize a loss function whose 
value depends on the cost of a design option and its informativeness. Thus, 
evaluation of the design options requires that the planner measure their cost 
and informativeness. 

Suppose that a planner wants to combine sample data on an item with 
specified assumptions in order to learn about a population parameter of in- 
terest. When the sample size is large, we propose that informativeness be 
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measured by the size of the identification region that a design option yields 
for this parameter. As explained in Manski (2003), the identification region 
for the parameter is the set of values that remain feasible when unlimited ob- 
servations from the sampling process are combined with the maintained as- 
sumptions. The parameter is point-identified when this set contains a single 
value and is partially identified when the set is smaller than the parameter's 
logical range, but is not a single point. In survey settings with large samples 
of respondents, where identification rather than statistical inference is the 
dominant inferential problem, we think it natural to measure informative- 
ness by the size of the identification region. The smaller the identification 
region, the better. Section 6 discusses measurement of informativeness when 
the sample size is small. Then confidence intervals for the partially identified 
parameter may be used to measure informativeness. 

Sections 4 and 5 apply the general ideas of Section 3 in two polar settings 
having distinct inferential problems. Section 4 studies cases in which there 
may be nonresponse to the questions posed but it is assumed that there are 
no response errors. We first derive the identification regions under options 
A, S and N. We then show the circumstances in which a survey planner 
should choose each option. To illustrate, we consider choice among options 
for querying respondents about their expectations for future personal Social 
Security benefits. The HRS 2006 used skip sequencing, as described in Illus- 
tration 1. Another option would be to ask all respondents both the broad 
and the personal question. A third option would be to ask only the broad 
question, omitting the one about future personal benefits. 

Section 5 studies the other polar setting in which there is full response 
but there may be response errors. Again, we first derive the identification 
regions under the three design options and then show when a survey planner 
should choose each option. To illustrate, we consider choice among options 
for querying respondents about limitations in ADLs. The NLSOM used skip 
sequencing, as described in Illustration 2. Another survey, the 1993 wave of 
the Assets and Health Dynamics Among the Oldest Old (AHEAD) asked 
all respondents about a set of specific ADLs. A third option would be to not 
ask about specific ADLs at all. 

Section 6 concludes by calling for further analysis of questionnaire design 
as a decision problem. 

2. Previous studies of skip sequencing. As far as we are aware, there has 
been no precedent research studying skip sequencing as a decision problem 
in questionnaire design. Messmer and Seymour (1982) and Hill (1991, 1993) 
are the only precedent studies recognizing that skip sequencing may amplify 
data quality problems. 

Messmer and Seymour studied the effect of skip sequencing on item non- 
response in a large scale mail survey. Their analysis asked whether the dif- 
ficult structure of the survey, particularly the fact that respondents were 
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instructed to skip to other questions perhaps several pages away in the 
questionnaire, increased the number of unanswered questions. Their analy- 
sis indicates that branching instructions significantly increased the rate of 
item nonresponse for questions following a branch, and that this effect was 
higher for older individuals. This work is interesting but it does not have 
direct implications for modern surveys, where skip sequencing is automated 
rather than performed manually. 

Hill used data from five interview/reinterview sequence pairs in the 1984 
Survey of Income and Program Participation (SIPP) Reinterview Program. 
He examined data errors that manifest themselves through a discrepancy 
between the responses given in the two interviews, and categorized these 
discrepancies in three groups. In his terminology, a response discrepancy 
occurs when a different answer is recorded for an opening question in the 
interview and in the reinterview. A response induced sequencing discrepancy 
occurs when, as a consequence of different answers to the opening question, 
a subsequent question is asked in only one of the two interviews. A pro- 
cedurally induced sequencing discrepancy occurs when, in one of the two 
interviews but not both, an opening question is not asked and, therefore, 
the subsequent question is not asked either. 

Hill used a discrete contagious regression model to assess the relative 
importance of these errors in reducing data quality. The contagion process 
was used to express the idea that error spreads from one question to the 
next via skip sequencing. Within this model, the "conditional population at 
risk of contagion" expresses the idea that the number of remaining questions 
in the sequence at the point where the initiating error occurs gives an upper 
bound on the number of errors that can be induced. Hill's results suggest 
that the losses of data reliability caused by induced sequencing errors are 
at least as large as those induced by response errors. Moreover, the relative 
importance of sequencing errors strongly increases with the sequence length. 
This suggests that the reliability of individual items will be lower, all else 
equal, the later they appear in the sequence. 

3. A formal design problem. 

3.1. The choice setting. We pose here a formal questionnaire design 
problem that highlights how skip sequencing may affect data quality. To 
focus on this matter, we find it helpful to simplify the choice setting in three 
major respects. 

First, we suppose that a large random sample of respondents is drawn 
from a much larger population. This brings identification to the fore as the 
dominant inferential problem, the statistical precision of sample estimates 
receding into the background as a minor concern. We also suppose that 
all sample members agree to be interviewed. Hence, inferential problems 
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arise only from item nonresponse and response errors, not from interview 
nonresponse. 

Second, we perform a "marginalist" analysis that supposes the entire de- 
sign of the questionnaire has been set except for one item. The only decision 
is whether and how to ask about this item. Marginalist analysis enormously 
simplifies the decision problem. In practice, a survey planner must choose 
the entire structure of the questionnaire, and the choice made about one 
item may interact with choices made about others. We recognize this but, 
nevertheless, find it useful for exposition to focus on a single aspect of the 
global design problem, holding fixed the remainder of the questionnaire. 

Third, we assume that the design chosen for the specific item in our 
marginalist analysis affects only the informativeness of that item. In practice, 
the choice of how to ask a specific item affects the length of the entire survey, 
which may influence respondents' willingness or ability to provide reliable 
responses to other items. We recognize this but, nevertheless, find it useful 
for exposition to suppose that the effect on other items is negligible. 

Let y denote the item under consideration. As indicated in the Introduction, 
the design options are as follows: 

A: ask all respondents to report y. 

S: ask only those respondents who respond positively to an opening 
question. 

TV: do not ask about y at all. 

The population parameter of interest is labeled r[P(y)], where P is the 
population distribution of y. For example, r[P(y)] might be the population 
mean or median value of y. 

3.2. Measuring the cost, informativeness, and loss of the design options. 
The design options differ in their costs and in their informativeness about 
T[P(y)]. Abstractly, let denote the cost of option k, let dk denote its 
informativeness, and let = L(cfc,(4) be the loss that the survey planner 
associates with option k. We suppose that the planner wants to choose a 
design option that minimizes L(ck,dk) over k € (A, S, N). 

To operationalize this abstract optimization problem, a survey planner 
must decide how to measure loss, cost, and informativeness. Loss presumably 
increases with cost and decreases with informativeness. We will not be more 
specific about the form of the loss function here. We will, for simplicity, use 
a linear form in our applications. 

Cost presumably increases with the fraction of respondents who are asked 
the item. In some settings, cost may be proportional to this fraction. Then 
Ck = "ifk-, where 7 > is the cost per respondent of data collection and fk 
is the fraction of respondents asked the item under option k. It is the case 
that 1 = f A > f s > fN = 0. Hence, c A = 7> cg = ifs, cn = 0. 
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As indicated in the Introduction, we propose measurement of the informa- 
tiveness of a design option by the size of the identification region obtained 
for the parameter of interest. In general, the size of an identification region 
depends on the specified parameter, the data produced by a design option, 
and the assumptions that the planner is willing to maintain. Sections 4 and 
5 show how in some leading cases. 

4. Question design with nonresponse. This section examines how non- 
response affects choice among the three design options. To focus attention 
on the inferential problem created by nonresponse, we assume that when 
sample members do respond, all answers are accurate. Section 4.1 considers 
identification of the parameter r[P(y)]. Section 4.2 shows how to use the 
findings to choose a design. Section 4.3 uses questions on future generosity 
of Social Security to illustrate. 

4.1. Identification with nonresponse. It has been common in survey re- 
search to impute missing values and to use these imputations as if they are 
real data. Standard imputation methods presume that data are missing at 
random (MAR), conditional on specified observable covariates; see Little 
and Rubin (1987). If the maintained MAR assumptions are correct, then 
parameter r[P(y)] is point-identified under both of design options A and S. 
Option S is less costly, so there is no reason to contemplate option A from 
the perspective of identification. If option A is used in practice, the rea- 
son must be to provide a larger sample of observations in order to improve 
statistical inference. 

Identification becomes the dominant concern when, as is often the case, a 
survey planner has only a weak understanding of the distribution of missing 
data. We focus here on the worst-case setting, in which the planner knows 
nothing at all about the missing data. It is straightforward to determine the 
identification region for r[P(y)] under design options A and S. We draw on 
Manski [(2003), Chapter 1] to show how. 

Option A. To formalize the identification problem created by nonre- 
sponse, let each member j of a population J have an outcome yj in a space 
Y = [0, s] . Here s can be finite or can equal oo, in which case Y is the nonneg- 
ative part of the extended real line. The assumption that y is nonnegative 
is not crucial for our analysis, but it simplifies the exposition and notation. 

The population is a probability space and y : J — > Y is a random variable 
with distribution P{y). Let a sampling process draw persons at random from 
J. However, not all realizations of y are observable. Let the realization of a 
binary random variable Zy indicate observability; y is observable if Zy = 1 
and not observable if Zy = 0. The superscript A shows the dependence of 
observability of y on design option A. 
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By the Law of Total Probability, 

(1) P(y) = P(y\z A = l)P{z A = 1) + P{y\z A = 0)P(zf = 0). 

The sampling process reveals P(y\zy = 1) and P(z A ), but it is uninformative 
regarding P(y\zy = 0). Hence, the sampling process partially identifies P(y). 
In particular, it reveals that P(y) lies in the identification region 

(2) R A [P(y)] = [P(y\z A = l)P{z A = 1) + ^P{z A =Q),ip £ *y]. 

Here 9y is the space of all probability distributions on Y and the superscript 
A on H shows the dependence of the identification region on the design 
option. 

The identification region for a parameter of P(y) follows immediately from 
H j4 [P(y)]. Consider inference on the parameter r[P(y)]. The identification 
region consists of all possible values of the parameter. Thus, 

(3) R A {r[P(y)]}^{r(v),rieR A [P(y)]}. 

Result (3) is simple but is too abstract to be useful as stated. Research 
on partial identification has sought to characterize H" 4 {r[i- > (?/)]} for different 
parameters. Manski (1989) does this for means of bounded functions of y, 
Manski (1994) for quantiles, and Manski [(2003), Chapter 1] for all parame- 
ters that respect first-order stochastic dominance. Blundell et al. (2007) and 
Stoye (2005) characterize the identification regions for spread parameters 
such as the variance, interquartile range and the Gini coefficient. 

The results for means of bounded functions are easy to derive and instruc- 
tive, so we focus on these parameters here. To further simplify the exposition, 
we restrict attention to monotone functions. Let 5ft be the extended real line. 
Let g(-) be a monotone function that maps Y into 5ft and that attains fi- 
nite lower and upper bounds go = min^y g(y) = g(0) and g\ = max ye y g(y). 
Without loss of generality, by a normalization, we set go = and g\ = 1 . The 
problem of interest is to infer E[g(y)]. 

The Law of Iterated Expectations gives 

(4) E[g(y)} = E[g{y)\z A = l]P{z A = 1) + E[g{y)\z A = 0]P(z A = 0). 

The sampling process reveals E[g(y)\zy = 1] and P(Zy), but it is uninfor- 
mative regarding E[g(y)\zy = 0], which can take any value in the interval 
[0,1]. Hence, the identification region for E[g{y)\ is the closed interval 

E A {E[g(y)}} = [E[g(y)\z A = l)P(z^ = 1), 

5 

E[g(y)\z A = l]P(z A = 1) + P(z A = 0)]. 

H A {E[g(y)]} is a proper subset of [0, 1] whenever P{z A = 0) is less than one. 
The width of the region is P(z A = 0). Thus, the severity of the identification 
problem varies directly with the prevalence of missing data. 
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Option S. There are two sources of nonresponse under option S. First, 
a sample member may not respond to the opening question, in which case 
she is not asked about item y. Second, a sample member may respond to 
the opening question but not to the subsequent question about item y. 

Let x denote the item whose value is sought in the opening question. 
As in Illustrations 1 and 2, we suppose that x is a broad item and that y 
is a more specific one. For simplicity, we suppose here that x G {0, 1} and 
that x = ==>- y = 0. A respondent is asked about y only if she answers 
the opening question and reports x = 1. For example, consider Illustration 2 
discussed in the Introduction. If a respondent does not have any limitation 
in ADLs (x = 0), then clearly the respondent does not have a limitation in 
bathing/showering (y = 0). Hence, the NLSOM asks about y only when a 
respondent reports x = 1. 

To formalize the identification problem, we need two response indicators, 
z~ and Zy, the superscript S showing the dependence of nonresponse on 
design option S. Let z~. = 1 if a respondent answers the opening question and 
let z x = otherwise. Let Zy = 1 if a respondent who is asked the follow-up 
question gives a response, with z~j = otherwise. Hence, Zy = 1 =>■ z~ = 1. 
This and the Law of Iterated Expectations and the fact that g(0) = give 

E[g(y)] = E[g(y)\x = l]P(x = 1) + E[g(y)\x = 0}P(x = 0) 

= E[g{y)\x=l,z s y =l]P{z S y =l,x=l) 

+ E[g(y) \x = 1,4 = 1, z s y = 0]P(zf = 1, = 0, x = 1) 

+ = 1, zf = 0]P(4 = 0, x = 1). 

The sampling process reveals E[g(y)\x = 1, z y = I], P(z^ = l,z y =0,x = 1), 
and P(Zy = 1) = P(Zy = l,x = 1), where the last equality holds because 
Zy =1 x = 1. The data are uninformative about £7[<?(y)|a; = = 
1, z^ = 0] and E[g(y)\x = 1,^ = 0], which can take any values in [0, 1]. The 
data are partially informative about P(z^ = 0,x = 1), which can take any 
value in [0,P(z^ = 0)]. It follows that the identification region for E[g{y)\ is 
the closed interval 

R s {E[9(y)}} = [E[g{y)\4 = = i), 

(6) E[g{y)\z S y =l]P{z S y =l) 

+ P(z s x = 1, z$ = 0, x = 1) + P(z s x = 0)] . 

Thus, the severity of the identification problem varies directly with the 
prevalence of nonresponse to the opening question and to the follow-up 
question in the subpopulation in which it is asked. 



10 



C. F. MANSKI AND F. MOLINARI 



4.2. Choosing a design. Now consider choice among the three design 
options (A,S,N). The widths of the identification regions for E[g(y)] under 
these options are as follows: 

d A = P(zf = 0), d s = P{z s x =l,^ = 0,x = l) + P(zf = 0), d N = l. 

For specificity, let the loss function have the linear form = jfk + dk- 
The first component measures survey cost and the second measures the 
informativeness of the design option. We set the coefficient on dk equal to 
one as a normalization of scale. The parameter 7 measures the importance 
that the survey planner gives to cost relative to informativeness. There is 
no universally "correct" value of this parameter. Its value is something that 
the survey planner must specify, depending on the survey context and the 
nature of item y. 

It follows from the above and from the derivations of Section 4.1 that the 
losses associated with the three design options are as follows: 

L A = 7 + P(z£ = 0), 

L s = iP{z s x = 1, x = 1) + P(4 = 1, z$ = 0, x = 1) + P(4 = 0), 
L N = 1. 

Thus, it is optimal to administer item y to all sample members if 
7 + P(z£ = 0) < min{l, 7 P(zf = 1, x = 1) 

+ P{4 = 1, z s y = 0, x = 1) + P{4 = 0)}. 
Skip sequencing is optimal if 

7-P(z£ = 1, x = 1) + P(z% = 1, ^ = °> x = !) + p ( z x = 0) 

<min{l, 7 + P(^ = 0)}. 

If neither of these inequalities hold, it is optimal not to ask the item at all. 

Determination of the optimal design option requires knowledge of the re- 
sponse rates that would occur under options A and S. This is where the 
body of survey research reviewed by Krosnick (1999) has a potentially im- 
portant role to play. Through the use of randomized experiments embedded 
in surveys, researchers have developed considerable knowledge of the re- 
sponse rates that occur when various types of questions are posed to diverse 
populations. In many cases, this body of knowledge can be brought to bear 
to provide credible values for the response rates that determine loss under 
options A and S. 

When the literature does not provide credible values for these response 
rates, a survey planner may want to perform his own pretest, randomly 
assigning sample members to options A and 5. The size of the pretest sample 
only needs to be large enough to determine with reasonable confidence which 
design option is best. It does not need to be large enough to give precise 
estimates of the response rates. 
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4.3. Questioning about expectations on the generosity of social security. 
Consider the questions on expectations for the future generosity of the Social 
Security program cited in Illustration 1. The opening question was posed to 
10,748 respondents to the 2006 HRS who currently receive social security 
benefits, and the follow-up was asked to the sub-sample of 9356 persons who 
answered the opening question and gave a response greater than zero. We 
assume here that the only data problem is nonresponse. The nonresponse 
rate to the opening question was 7.23%. The nonresponse rate to the follow- 
up question, for the subsample asked this question, was 2.27%. It is plausible 
that someone may not be willing to respond to the first question and yet 
be willing to respond to the second one. In particular, this would happen if 
a person does not want to speculate on what Congress will do but, never- 
theless, is sure that if Congress does act, it would only change benefits for 
future retirees, not for those already in the system. The HRS use of skip 
sequencing prevents observation of y in such cases. 

To cast this application into the notation of the previous section, we let 
x = 1 if a respondent places a positive probability on Congress acting, with 
x = otherwise. The rest of the notation is the same as above. 

An early release of the HRS data provide these empirical values for the 
quantities that determine the identification region for E[g(y)\ and loss under 
design option S: 





= 1) 


= 0.0197 


p(4 = i,x 


= 1) 


= 0.8705 




= 1) 


= 0.8508 




= 0) 


= 0.0723 


E[g{y)\4 


= 1] 


= 0.4039 



where g(y) = Hence, the identification region for E[g(y)] under option 
S is n s {E[g(y)]} = [0.3436,0.4356] and loss is L s = 0.87057 + 0.0920. 

The HRS data do not reveal the quantities that determine the identifica- 
tion region for E[g(y)] and loss under design option A. For this illustration, 
we conjecture that the mean response to item y that would be obtained under 
option A equals the mean response that is observed under option S. Thus, 
E\g(y)\z$ = 1] =0.4039. We suppose further that the nonresponse proba- 
bility would be P(Zy = 0) = 0.08. Then the identification region for E[g(y)] 
under option A is R A {E[g(y)]} = [0.3716,0.4516] and loss is L A = 7 + 0.08. 

It follows from the above that it is optimal to administer item y to all 
sample members if 



7 < 0.0927. 
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Skip sequencing is optimal if 

0.0927<7< 1.0431. 
If neither of these inequalities hold, it is optimal not to ask the item at all. 

5. Question design with data errors. This section examines how re- 
sponse errors affect choice among the three design options. To focus at- 
tention on the inferential problem created by such errors, we assume that 
all sample members respond to the questions posed. Section 5.1 considers 
identification. Section 5.2 shows how to use the findings to choose a design. 
Section 5.3 uses questions on limitations in ADLs to illustrate. 

5.1. Identification with response errors. Section 4 showed that assump- 
tions about the distribution of missing data are unnecessary for partially 
informative inference in the presence of nonresponse. In contrast, assump- 
tions on the nature or prevalence of response errors are a prerequisite for 
inference. In cases where y is discrete, it is natural to think of data errors 
as classification errors. We conceptualize response error here through a mis- 
classification model previously used by Molinari (2003, 2008), and we draw 
on her findings. The Appendix discusses the mixture model of data errors, 
which yields equivalent results beginning from a different conceptualization 
of data errors. 

The misclassification model is a simple formalism that does not have con- 
tent per se. It becomes informative when it is combined with an assumed 
upper bound on the prevalence of data errors. When such a bound is avail- 
able, Molinari (2003) showed that E[g(y)] is partially identified under design 
option A. It is straightforward to show the same under option S. To simplify 
the exposition, we focus here on the particularly simple case where y S {0, 1} 
and g(y) = y. Corresponding results for general discrete Y and any bounded 
function g(-) : Y — > [0, 1] may be obtained from the authors. 

Option A. As in Section 4, let each member j of a population J have 
an outcome yj and let P(y) be the population distribution of y. Let a sam- 
pling process draw persons at random from J. Let y:J^>Y denote the 
responses that population members would give when queried about y. The 
researcher observes realizations of y, which can either equal or differ from 
the corresponding realizations of y. When y ^ y, data errors occur. 

The misclassification model begins with the basic observation that, by the 
Law of Total Probability, 

-p(y A = l) 
P(y A = 0) 



P(y A = l\y = l) 
P(y A = 0\y = l) 



P(y A = l\y = 0) 

P(y A =0\y = 0) 



P(y = i) 
P(y = o). 
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The superscript A shows that the response y A depends on design option 
A. The sampling process reveals only P(y A ), which per se is uninformative 
about -P(y). The basic maintained assumption is a known nontrivial lower 
bound 1 — X A > on the probability that the realizations of y A and y coin- 
cide, or, strengthening this assumption, a known nontrivial lower bound on 
the probability of correct report for each value that y can take. Formally, 
these assumptions are as follows: 

Assumption 1. P(y = y A ) > 1 - X A > 0. 

Assumption 2. P(y A = k\y = k) > 1 - X A > 0, V k e Y. 

Molinari (2003) shows that, under Assumption 1, 
(7) U A [P(y = 1)] = [0, 1] n [P(y A = 1)-X A , P(y A = 1) + X A ], 
while, under Assumption 2, 

■p(y A = l)-X A P(y A = l 



(8) H A [P(y = l)] = [0,l]n 



1 - X A ' 1 - X A 



Observe that these identification regions yield informative lower and upper 
bounds on P(y = 1) when A A < P(y A = 1) < 1 - X A . 

Results (7) and (8) were derived earlier by Horowitz and Manski (1995), 
using a different formalization of data errors. They studied partial identifi- 
cation of probability distributions under the mixture model of data errors 
used in studies of robust inference following Huber (1964). Their main as- 
sumption was the availability of an upper bound on the prevalence of data 
errors as defined in the mixture model, just as Huber assumed in his seminal 
research. See the Appendix for further discussion of the relationship between 
the mixture model and the misclassification model. 



Option S. There are two sources of potential response error under option 
S. First, a sample member may respond with error to the opening question. 
Then she is erroneously not asked the follow up question if she gives a false 
negative answer, and she is erroneously asked the follow up question if she 
gives a false positive answer. Second, a sample member may (truthfully) 
respond affirmatively to the opening question and then respond with error 
to the follow up. 

As in Section 4, we let y denote the true value of the variable of interest 
and x denote the true value of the variable elicited in the opening question. 
The error ridden versions of these variables are y s and x s respectively. As in 
Section 4, skip sequencing has certain logical implications when the opening 
question inquires broadly about a subject and the follow up inquires more 
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specifically. These logical relations are x = y = and x s = ==> y s = 
0. 

The misclassification model begins with the observation that, by the Law 
of Total Probability, 

P(x s = i,y s = k) 

= Y P{x s = i,y s = k\x = l,y = m)P(x = l,y = ?n), 
i=0,lm=0,l 

»,fce{o,i}. 

The sampling process reveals only the quantities P(x s = i,y s = k) on the 
left-hand side of these equations, with the logic of skip sequencing imply- 
ing that P(x s = l,y s = 1) =P{y s = 1), P(x s = 0, y s = 0) = P(x s = 0) and 
P(x s = 0,y s = 1) = 0. The logic of skip sequencing also implies that P(x = 
1, y = 1) = p( y = 1), p(x = 0, y = 0) = P{x = 0) and P(x = 0, y = 1) = 0. 

The observable quantities and logical restrictions per se are uninforma- 
tive about P(y), but they become informative when combined with these 
extensions of Assumptions 1 and 2: 

Assumption 3. P(x = x s ,y = y s ) > 1 - X s > 0. 

Assumption 4. P(x s = i, y s = k\x = i, y = k) > 1 - X s > 0, i, k e {0, 1}, 
k<i. 

Extension of the argument of Molinari (2003) shows that, under Assump- 
tions 3, 

(9) R s [P(y = 1)] = [0, 1] n [P(y s = 1) - X s ,P(y s = 1) + X s ], 

while, under Assumption 4, 

■P(y s = l)-X s P{y s = iy 



(10) H 5 [P(y = l)] = [0,l]n 



1-A S ' I -X s 



These identification regions yield informative lower and upper bounds on 
P(y = 1) when X s < P(y s = 1) < 1 - A 5 . 

Whereas Assumptions 1 and 2 only concerned the coincidence of the true 
and reported values of y, Assumptions 3 and 4 concern the joint coincidence 
of the true and reported values of (x,y). Hence, it is reasonable to think 
that a survey planner will ordinarily specify a higher lower bound in the 
first case than the second; that is, 1 — X A > 1 — X s . 
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5.2. Choosing a design. Now consider choice among the three design 
options. The width of the identification region for P(y = 1) under option ./V 
remains dpf = 1, and therefore, the loss associated with this option is Ln = 1. 

For simplicity, we focus here on the case when the identification regions 
under Options A and S yield informative lower and upper bounds; that is, 
\ k < P(y k = 1) < 1 — A fc , k G (A,S). Table 1 contains the results for other 
cases. 

Under Assumptions 1 and 3, the widths of the identification regions for 
P(y = 1), under design options A and S, are dk = 2X k , k G (A, S). Therefore, 
the losses associated with these two design options are 

L A = 1 + 2X A , L s = jP(x S = 1) + 2\ S . 

Thus, it is optimal to ask about item y to all sample members if 

7 + 2A A < min{l, 7 P(x 5 = 1) + 2A 5 }. 
Skip sequencing is optimal if 

-yP(x s = 1) + 2A 5 < min{l, 7 + 2\ A }. 

If neither of these inequalities hold, it is optimal not to ask the item at all. 

Under Assumptions 2 and 4, the widths of the identification regions for 
P(y = 1) are dk = 1-Afc , k G (A, S). Therefore, the losses are 

^ = 7 + 77^, L s = 1 P(x s = l) + I ^. 
Thus, it is optimal to ask about item y to all sample members if 

^ r „, s s a s 
-1 + 



A f 
7 + - _ xA < minj 1, 7 P(x 



1-A S 

Table 1 

Value of Lk depending on the relationship between A fc and P(y k = 1), k £ {A,S) 

Assumptions 1 and 3 Assumptions 2 and 4 



1 — \ A < P{y A = 1) < \ A 




La 


= 7 + 1 




L A 


= 7+1 


-P(y A = l) <min{A A ,l- 


A A } 


La 


= 7 + P(£ A = l) 


+ A A 


La 


1 -P(« A =i) 


\ A < P(y A = 1) < 1 - X A 




La 


= 7 + 2A A 




La 


= 7+^pr 


P(y A = l) >max{A A ,l- 


A' 4 } 


La 


= ~/ + l-P(y A 


= 1)+A A 


La 


'~ 1-A- 4 


1-A S <P(£ S = 1)<A S 




Ls 


= 7*1 + 1 




Ls 


= 7*1 + 1 


P(y S = 1) < min{A s ,l - 


A 5 } 


Ls 


= 7 <5f + P(y s = 


1)+A S 


Ls 




A s < P(y s = 1) < 1 - A s 




Ls 


= 7*i? + 2A s 




Ls 




P(y s = l)>max{A s ,l- 


A S } 


Ls 


= j5% + l-P(y 


s = 1) + A s 


Ls 


-7^ S + ^T=^ 



Note. 5f =. P(f s = 1). 



16 



C. F. MANSKI AND F. MOLINARI 



Skip sequencing is optimal if 



~/P{x b = 1) + 



A 5 



} 



1-X S 



< min< 1,7 + 



1-X A 



If neither of these inequalities hold, it is optimal not to ask the item at all. 

Determination of the optimal design option requires information on the 
nature and prevalence of response errors under options A and S. There have 
been occasional validation and reliability studies documenting the extent 
of measurement error in survey items; see, for example, Groves (1989) and 
Bound, Brown and Mathiowetz (2001). When the literature does not provide 
credible upper bounds for the probability of data errors, a survey planner 
may want to perform his own pretest, randomly assigning sample members 
to options A and S, and then obtain corresponding validation or reliability 
data. As in Section 4, the size of the pretest sample only needs to be large 
enough to determine with reasonable confidence which design option is best. 
It does not need to be large enough to give precise estimates of the upper 
bounds on the probabilities of data errors. 

5.3. Questioning about limitations in ADLs. Consider the questions on 
limitations in ADLs cited in Illustration 2. The opening question was posed 
to 2092 respondents to the 1990 NLSOM, of whom 92.45% were self re- 
spondents and 7.55% were proxy respondents. The follow-ups were asked to 
the 192 persons who responded to the opening question and gave an affir- 
mative answer. We focus here on the first follow-up ADL question: "Now I 
would like to be more specific. Because of a health or physical problem, do 
you receive help from another person in bathing or showering?" The non- 
response rate to the opening question was 0.62%. The nonresponse rate to 
the follow-up question, for the subsample asked this question, was 0.52%. 
Given these minimal nonresponse rates, we abstract from nonresponse here 
and concentrate our attention on response error. 

To keep this illustration simple, we suppose here that the question on 
bathing or showering is the only follow up to the NLSOM opening ques- 
tion on limitations in ADLs. A more realistic analysis would jointly consider 
the six follow up questions that actually appear in the survey. This is a 
straightforward extension of our analysis if one maintains the "marginalist" 
assumption that the design chosen for the set of ADL items does not affect 
data quality elsewhere in the survey. We think this assumption reasonable, 
because the NLSOM contains only six easily understood questions on lim- 
itations in specific ADLs. Item nonresponse to these questions is minimal. 
Item nonresponse also was minimal when similar questions were asked in 
the AHEAD survey, described below, which does not use skip sequencing. 

We caution that there are circumstances in which skip sequencing avoids 
having to ask some respondents a long, laborious sequence of irrelevant ques- 
tions. When this is the case one may, as noted in Section 3.1, think that the 



SKIP SEQUENCING 



17 



skip sequencing decision may materially affect respondents' willingness or 
ability to provide reliable responses throughout the survey. When respon- 
dent burden is a potential concern, one may find it necessary to move away 
from simple marginalist analysis of the type we perform and instead treat 
the design of the entire questionnaire as a complex joint decision problem. 

For this illustration, we take the parameter of interest to be the cross- 
sectional probability P(y = 1) that an individual in the population rep- 
resented by the NLSOM needs help in bathing/showering. This is one of 
several parameters of potential interest when studying limitations in ADL. 
Connor et al. (2006) emphasize the importance of longitudinal measurement 
of the duration of disability and of transitions in and out of disability. Con- 
cern with these matters might lead one to be interested in P[y(t) — y(t — k)] 
or P[y(t)\y(t — k)}, where y(t) and y(t — k) measure limitations in ADLs at 
two interviews spaced k years apart. It would be of interest to characterize 
the identification regions for these transition parameters under alternative 
questionnaire designs. 

Consider P(y = 1). The reported probability is P(y s = 1) = 0.073. To ap- 
ply the misclassification model, we need to set values for the upper bounds 
X A and X s on the probability of occurrence of data errors under options A 
and S. We are not aware of validation studies placing upper bounds on the 
probability of data errors in self reports of limitations in ADLs for popu- 
lations similar to the one surveyed by the NLSOM, under design option S. 
However, there have been studies that compare self reports and proxy re- 
ports, as well as some that assess the time series consistency of self reports 
across interviews. Most of this work analyzes surveys in which the ques- 
tionnaire uses design option A. See, for example, Rubenstein et al. (1984), 
Mathiowetz and Groves (1985), Moore (1988), Mathiowetz and Lair (1994), 
Rodgers and Miller (1997), Mathiowetz and Wunderlich (2000) and Miller 
and DeMaio (2006). In particular, Rubenstein et al. (1984) and Miller and 
DeMaio (2006) report the results of reliability studies providing information 
on the prevalence of data errors. 

Rubenstein et al. (1984) analyze two samples of individuals, one providing 
data on hospitalized elderly persons and the other on nursing home residents. 
They compare the reports of limitations in ADLs and additional daily ac- 
tivities (such as telephoning, shopping, handling finances, cooking, etc.) of 
the institutionalized elderlies and of a "community proxy" (a spouse, child, 
or close friend) with those of a nurse proxy. If one assumes that the report 
of the nurse proxy is always correct, one can conclude from this study that 
the probability of a data error is bounded above by 0.36. Miller and DeMaio 
(2006) analyze data on limitations in bathing/showering collected in the 
2006 administration of the American Community Survey Content Test. Re- 
liability estimates based on reinterviews suggest a probability of data errors 
of at most 0.17. 
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The sampling frame and questionnaire design of the NLSOM differ from 
the ones analyzed in these reliability studies. Hence, their findings can only 
be suggestive for our purposes. In what follows we use the bounds in As- 
sumption 5 below. Table 2 collects the results obtained using different values 
of X A and X s , which encompass the upper bounds on probabilities of data 
errors reported by Rubenstein et al. (1984) and Miller and DeMaio (2006). 

Assumption 5. X A = 0.15, X s = 0.25. 

The identification regions for P(y = 1) under design options A and S are 
given in Table 1. [The forms given in Section 5.2 do not apply here because 
the inequalities X k < P(y k = 1) < 1 — X k , k € (A, S) do not hold in this ap- 
plication.] Using X s = 0.25 as the upper bound on data errors under design 
option S, the identification region for P(y = 1) is R s [P(y = 1)] = [0,0.3230] 
under Assumption 3 and H [P(y = 1)] = [0,0.0973] under Assumption 4. 
The data reveal that P(x s = 1) = 0.092. Hence, loss is L s = 0.0927 + 0.3230 
under Assumption 3, and Ls = 0.0927 + 0.0973 under Assumption 4. 

The NLSOM data do not reveal the quantity P(y A = 1) needed to deter- 
mine the identification region for P(y = 1) under design option A. For this 
illustration, we conjecture that the rate of reported limitations in bathing/ 
showering that would be obtained under option A equals the rate that is 
observed under option S. Thus, P(y A = 1) = 0.073. Using X A = 0.15 as the 
upper bound on data errors under option A, the identification region for 
P(y = 1) is H A [P(y = 1)] = [0,0.2230] under Assumption 1 and R A [P(y = 
1)] = [0,0.0859] under Assumption 2. Hence, loss is La = 7 + 0.2230 under 
Assumption 1 and La = 7 + 0.0859 under Assumption 2. 

It follows that it is optimal to ask all sample member about item y if 

7 + 0.2230 < min{l, 0.092 7 + 0.3230} 



7 + 0.0859 < min{l, 0.092 7 + 0.0973} 

Skip sequencing is optimal if 

0.092 7 + 0.3230 < min{l, 7 + 0.2230} <= 



0.092 7 + 0.0973 < min{l, 7 + 0.0859} 
Otherwise, it is optimal not to ask the item at all. 



7 < 0.1101 
under Assumptions 1 and 3, 

7 < 0.0126 
under Assumptions 2 and 4. 

» 0.1101 <7< 7.3587 
under Assumptions 1 and 3, 

»0.0126<7< 9.8116 
under Assumptions 2 and 4. 
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Table 2 

Values of 7 that determine the choice of a certain design option, depending on (\ A , X s ) 



Assumptions 1 and 3 Assumptions 2 and 4 



Option A is chosen Option S is chosen Option A is chosen Option S is chosen 



0.100 





,100 




Never 





,000 


< 


7 


< 


8 


.989 


Never 





.000 


< 


7 


< 


9 


.988 







.125 


7 


< 0.027 





.027 


< 


7 


< 


8 


.717 


7 < 0.003 





.003 


< 


7 


< 


9 


963 







.170 


7 


< 0.077 





.077 


< 


7 


< 


8 


.228 


7 < 0.007 





.007 


< 


7 


< 


9 


.914 




o 


200 


7 


< 0.110 


o 


.110 


< 


'"V 

I 


< 


7 


.902 


7 < 0.011 


o 


on 


< 


i 


< 


9 


.878 







.360 


7 


< 0.286 





.286 


< 


7 


< 


6 


.163 


7 < 0.036 





.036 


< 


7 


< 


9 


630 







.400 


7 


< 0.330 





.330 


< 


7 


< 


5 


.728 


7 < 0.045 





.045 


< 


7 


< 


!) 


,547 


0.125 





,125 




Never 





.000 


< 


7 


< 


8 


.717 


Never 





.000 


< 


7 


< 


9 


,963 







.170 


7 


< 0.050 





.050 


< 


7 


< 


8 


.228 


7 < 0.005 





.005 


< 


7 


< 


9 


,914 







,200 


7 


< 0.083 





.083 


< 


7 


< 


7 


.902 


7 < 0.008 





.008 


< 


7 


< 


9 


,878 







,360 


7 


< 0.259 





.259 


< 


7 


< 


6 


.163 


7 < 0.034 





.034 


< 


7 


< 


9 


,630 







,400 


7 


< 0.303 





.303 


< 


7 


< 


5 


.728 


7 < 0.042 





.042 


< 


7 


< 


9 


,547 


0.170 





,170 




Never 





.000 


< 


7 


< 


8 


.228 


Never 





.000 


< 


7 


< 


9 


,914 







,200 


7 


< 0.033 





.033 


< 


7 


< 


7 


.902 


7 < 0.004 





.004 


< 


7 


< 


9 


,878 







,360 


7 


< 0.209 





.209 


< 


7 


< 


6 


.163 


7 < 0.029 





.029 


< 


7 


< 


9 


,630 







,400 


7 


< 0.253 





,253 


< 


7 


< 


5 


.728 


7 < 0.037 





.037 


< 


7 


< 


9 


,547 


0.200 





,200 




Never 





.000 


< 


7 


< 


7 


.902 


Never 





.000 


< 


7 


< 


9 


,878 







360 


7 


< 0.176 





,176 


< 


7 


< 


6 


.163 


7 < 0.025 





.025 


< 


7 


< 


9 


,630 







,400 


7 


< 0.220 





.220 


< 


7 


< 


5 


.728 


7 < 0.033 





.033 


< 


7 


< 


9 


,547 


0.360 





,360 




Never 





.000 


< 


7 


< 


6 


.163 


Never 





.000 


< 


7 


< 


9 


630 







,400 


7 


< 0.044 





.044 


< 


7 


< 


5 


.728 


7 < 0.008 





.008 


< 


7 


< 


9 


,547 


0.400 





,400 




Never 





.000 


< 


T 


< 


5 


.728 


Never 





.000 


< 


7 


< 


9 


,547 



We conclude this section by calling attention to the fact that the 1993 
wave of the Assets and Health Dynamics Among the Oldest Old (AHEAD) 
survey targeted a population similar in age to the NLSOM. The AHEAD 
survey also asked respondents about their limitations in ADLs, but it used 
neither design option A ox S. Instead, AHEAD omitted the opening broad 
question of the NLSOM and immediately posed a series of specific questions 
to all respondents. The fraction of AHEAD respondents who reported limi- 
tations in bathing/showering was 0.085, a value close to that elicited in the 
NLSOM. To compare the AHEAD and NLSOM designs would require gen- 
eralization of the decision problem that we set up in Section 3. In particular, 
we would need to take into account the loss of information on limitations 
in ADLs that may potentially occur in AHEAD by dropping the opening 
question. 

6. Conclusion. Survey planners have long had to cope with the tension 
between the desire to reduce the costs and increase the informativeness of 
surveys. However, they have not studied questionnaire design as a formal 
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decision problem in which one uses an explicit loss function to quantify the 
trade-off between cost and informativeness. Groves (1987) called attention to 
this in an article in Public Opinion Quarterly (POQ), writing (page S167): 

"The inextricable link between costs and errors rarely is formally acknowl- 
edged in methods articles in POQ, or in any other scholarly journal for that 
matter. That state of affairs has two detrimental effects: (f) methodologists 
invent methods to reduce an error, but fail to measure the cost impact of the 
new idea, and (2) practitioners reject new ideas until it becomes clear that 
they result in reduced costs. Given the link between errors and costs, many 
new ideas require spending money to reduce an error." 

Groves went on to contrast the situation in questionnaire design with that 
in survey sampling, which has long used formal models of cost and sampling 
error to analyze the problem of choosing sample size. See also Spencer (1980, 
1985, 1994), who has argued broadly for benefit-cost analysis of programs 
of data collection, with particular attention to the U.S. Census. 

This paper has formally analyzed skip sequencing as a decision problem 
in questionnaire design. We have intentionally kept the exposition simple 
in order to highlight the basic trade-off between cost and informativeness 
in choosing a design option. Survey researchers and statisticians with tradi- 
tional training may be least familiar with our measurement of informative- 
ness by the size of the identification region for a population parameter of 
interest. Although identification is the central problem generated by nonre- 
sponse and response errors, the research literatures in survey research and 
statistics contain remarkably little formal analysis of identification. We think 
that the illustrative cases considered in Sections 4 and 5 give a construc- 
tive sense of how to proceed, without getting bogged down in mathematical 
detail. 

While identification is the dominant issue in assessing data quality in large 
surveys, sampling error can also be a significant concern in smaller surveys. 
A straightforward extension of our work to smaller surveys is to measure 
informativeness through a confidence interval for the partially identified 
parameter of interest. The literature on partial identification has recently 
spawned many approaches to the construction of asymptotically valid confi- 
dence intervals. See, for example, Imbens and Manski (2004), Chernozhukov, 
Hong and Tamer (2007) and Beresteanu and Molinari (2008). Another ap- 
proach, with a firmer decision-theoretic foundation, would be to address the 
questionnaire design problem from the perspective of Wald (1950). 

APPENDIX: MIXTURE MODEL AND MISCLASSIFICATION MODEL 

The mixture model of robust statistics introduces latent variables e € Y 
and w € {0, 1}, and views the reported values y as generated by the mixture 
y = wy + (1 — w)e. The unobservable binary variable w denotes whether y or 
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e is observed. Realizations of y with w = 1 are said to be error free and those 
with w = are said to be data errors. By the Law of Total Probability, the 
relationship between the observable distribution P{y) and the unobservable 
distribution P(y) is 

(11) P(y) = P(y\w = l)P(w = 1) + P(e\w = 0)P(w = 0), 

(12) P(y) = P(y\w = l)P(w = 1) + P(y\w = 0)P(w = 0). 

The mixture model per se is a formalism without content. It becomes infor- 
mative when accompanied by assumption of an upper bound on the occur- 
rence of data errors, as follows: 

Assumption A. 1. P(w = 0) < A < 1. 

It is sometimes also assumed that the occurrence of errors is statistically 
independent of the value of y. That is, 

Assumption A. 2. y±w. 

Horowitz and Manski (1995) studied the implications of the mixture 
model for partial identification of probability distributions; see also Man- 
ski (2003), Chapter 4. They derived the identification region for P(y) and 
for parameters of this distribution that respect stochastic dominance, under 
Assumption A.l alone and under Assumptions A.l and A. 2. They refer to 
the first case as "corrupted sampling," and to the second as "contaminated 
sampling." 

The relationship between the mixture model and the misclassification 
model can be easily established starting from equation (11). Observe that 

P(y = j\y = k) 

( 13 ) „, . ^ 




P{w = l\y = k) + P(e = k\y = k,w = 0)P(w = 0\y = k), 

if j = k, 

P( e = j\y = k,w = 0)P(w = 0\y = k), if j ^ k. 



Hence, assumptions on P(w\y) translate immediately into assumptions for 
the misclassification model. Molinari (2003) shows that if the distribution 
of e is unrestricted, the mixture model with Assumptions A.l and A. 2 is 
equivalent to the misclassification model with an assumption specifying a 
common lower bound on the probabilities of correct report, P(y = k\y = k), 
k EY. The mixture model with Assumption A.l alone is equivalent to the 
misclassification model with an assumption specifying a lower bound on the 
probability that y and y coincide, P(y = y). 
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